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Abstract 

We study the scalability of consensus-based distributed optimization algorithms by considering two questions: How many 
processors should we use for a given problem, and how often should they communicate when communication is not free? Central 
to our analysis is a problem-specific value r which quantifies the communication/computation tradeoff. We show that organizing 
the communication among nodes as a fe-regular expander graph |1| yields speedups, while when all pairs of nodes communicate 
(as in a complete graph), there is an optimal number of processors that depends on r. Surprisingly, a speedup can be obtained, 
in terms of the time to reach a fixed level of accuracy, by communicating less and less frequently as the computation progresses. 
I Experiments on a real cluster solving metric learning and non-smooth convex minimization tasks demonstrate strong agreement 

("*") between theory and practice. 

o { I. Introduction 

How many processors should we use and how often should they communicate for large-scale distributed optimization? We 
address these questions by studying the performance and limitations of a class of distributed algorithms that solve the general 
*/~) optimization problem 

^ 1 m 

U minimize Fix) = — > L(x) (1) 
xex w m ^ J w 
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^ where each function lj(x) is convex over a convex set X C Mr. This formulation applies widely in machine learning scenarios, 
Q where lj(x) measures the loss of model x with respect to data point j, and F(x) is the cumulative loss over all m data points. 

Although efficient serial algorithms exist 0, the increasing size of available data and problem dimensionality are pushing 
I computers to their limits and the need for parallelization arises [3|. Among many proposed distributed approaches for solving 
J> ([TJ, we focus on consensus-based distributed optimization @), 0, 0, Q where each component function in ([T| is assigned 
^sO to a different node in a network (i.e., the data is partitioned among the nodes), and the nodes interleave local gradient-based 
t" optimization updates with communication using a consensus protocol to collectively converge to a minimizer of F(x). 

Consensus-based algorithms are attractive because they make distributed optimization possible without requiring centralized 
coordination or significant network infrastructure (as opposed to, e.g., hierarchical schemes [8|). In addition, they combine 
simplicity of implementation with robustness to node failures and are resilient to communication delays [9|. These qualities 
are important in clusters, which are typically shared among many users, and algorithms need to be immune to slow nodes 
that use part of their computation and communication resources for unrelated tasks. The main drawback of consensus-based 
optimization algorithms comes from the potentially high communication cost associated with distributed consensus. At the same 
time, existing convergence bounds in terms of iterations (e.g., |7]i below) suggest that increasing the number of processors 
slows down convergence, which contradicts the intuition that more computing resources are better. 

This paper focuses on understanding the limitations and potential for scalability of consensus-based optimization. We build 
on the distributed dual averaging framework |4). The key to our analysis is to attach to each iteration a cost that involves 
two competing terms: a computation cost per iteration which decreases as we add more processors, and a communication cost 
which depends on the network. Our cost expression quantifies the communication/computation tradeoff by a parameter r that 
is easy to estimate for a given problem and platform. The role of r is essential; for example, when nodes communicate at 
every iteration, we show that in complete graph topologies, there exists an optimal number of processors n opt — while 
for &;-regular expander graphs [1|, increasing the network size yields a diminishing speedup. Similar results are obtained when 
nodes communicate every h > 1 iterations and even when h increases with time. We validate our analysis with experiments 
on a cluster. Our results show a remarkable agreement between theory and practice. 

In Section [II] we formalize the distributed optimization problem and summarize the distributed dual averaging algorithm. 
Section [III] introduces the communication/computation tradeoff and contains the basic analysis where nodes communicate at 
every iteration. The general case of sparsifying communication is treated in Section IV Section [V] tests our theorical results 
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on a real cluster implementation and Section VI discusses some future extensions 
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II. Distributed Convex Optimization 

Assume we have at our disposal a cluster with n processors to solve and suppose without loss of generality that to is 
divisible by n. In the absence of any other information, we partition the data evenly among the processors and our objective 
becomes to solve the optimization problem, 



minimize F(x) = — V)Zj(a;) = - V h- V) lj\i(x)\ = - V fi(x) 



m z — ' n z — ' I TO 

j=l 1=1 \ j=l 



(2) 



where we use the notation lju to denote loss associated with the jth local data point at processor i (i.e., j\i = (i — 1)S + j). 
The local objective functions fi(x) at each node are assumed to be L-Lipschitz and convex. The recent distributed optimization 
literature contains multiple consensus-based algorithms with similar rates of convergence for solving this type of problem. We 
adopt the distributed dual averaging (DDA) framework [4 1 because its analysis admits a clear separation between the standard 
(centralized) optimization error and the error due to distributing computation over a network, facilitating our investigation of 
the communication/computation tradeoff. 

A. Distributed Dual Averaging (DDA) 

In DDA, nodes iteratively communicate and update optimization variables to solve Q. Nodes only communicate if they are 
neighbors in a communication graph G = (V, E), with the \V\ — n vertices being the processors. The communication graph is 
user-defined (application layer) and does not necessarily correspond to the physical interconnections between processors. DDA 
requires three additional quantities: a 1-strongly convex proximal function ip : H d — > R satisfying ip(x) > and ip(0) = 
(e.g., tj}{x) = ^x T x); a positive step size sequence a(t) = O(^); and anxn doubly stochastic consensus matrix P with 
entries p$j > only if either i = j or E E and pij = otherwise. The algorithm repeats for each node i in discrete 
steps t, the following updates: 

n 

z i (t)=^2 Pij z j (t-l)+g i {t-l) (3) 
i=i 

Xi(t) =argmin < (zi(t),x) + —rxi>{x) \ (4) 

Xi(t) =-({t-l)- Xi (t-l)+Xi{t)) (5) 

where gi(t — 1) g dfi{xi(t — 1)) is a subgradient of fi(x) evaluated at Xi(t — 1). In Q, the variable Zi{t) € R d maintains an 
accumulated subgradient up to time t and represents node i's belief of the direction of the optimum. To update Zi(t) in ([3J, 
each node must communicate to exchange the variables Zj(t) with its neighbors in G. If ip(x*) < R 2 , for the local running 
averages sti(t) defined in the error from a minimizer x* of F(x) after T iterations is bounded by (Theorem 1, H) 

Erri(T) = F(x t {T)) - 




■^(t)L + Mt)-^(t)Lj (6) 

where L is the Lipschitz constant, indicates the dual norm, z(t) = ^ 5Z" =1 ^i(i), and \\z(t) — quantifies the 

network error as a disagreement between the direction to the optimum at node i and the consensus direction z(t) at time t. 
Furthermore, from Theorem 2 in J4), with a{t) = after optimizing for A we have a bound on the error, 



^ i{ T)<Ci l °^^, C 1 =2L R Jl9 +Y ^ m , (7) 



where A2 is the second largest eigenvalue of P. The dependence on the communication topology is reflected through A2, since 
the sparsity structure of P is determined by G. According to ffl, increasing n slows down the rate of convergence even if A2 
does not depend on n. 

III. Communication/Computation Tradeoff 

In consensus-based distributed optimization algorithms such as DDA, the communication graph G and the cost of transmitting 
a message have an important influence on convergence speed, especially when communicating one message requires a non- 
trivial amount of time (e.g., if the dimension d of the problem is very high). 
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We are interested in the shortest time to obtain an e-accurate solution (i.e., Errj(T) < e). From convergence is faster 
for topologies with good expansion properties; i.e., when the spectral gap 1 — VA2 does not shrink too quickly as n grows. 
In addition, it is preferable to have a balanced network, where each node has the same number of neighbors so that all nodes 
spend roughly the same amount of time communicating per iteration. Below we focus on two particular cases and take G to 
be either a complete graph (i.e., all pairs of nodes communicate) or a fc-regular expander UJ. 

By using more processors, the total amount of communication inevitably increases. At the same time, more data can be 
processed in parallel in the same amount of time. We focus on the scenario where the size m of the dataset is fixed but possibly 
very large. To understand whether there is room for speedup, we move away from measuring iterations and employ a time 
model that explicitly accounts for communication cost. This will allow us to study the communication/computation tradeoff 
and draw conclusions based on the total amount of time to reach an e accuracy solution. 



A. Time model 

At each iteration, in step ([3]), processor i computes a local subgradient on its subset of the data: 

M = 9 M X ) = n dl j{i (x) 
dx m ^— J dx 

The cost of this computation increases linearly with the subset size. Let us normalize time so that one processor compute a 
subgradient on the full dataset of size to in 1 time unit. Then, using n cpus, each local gradient will take i time units to 
compute. We ignore the time required to compute the projection in step often this can be done very efficiently and requires 
negligible time when to is large compared to n and d. 

We account for the cost of communication as follows. In the consensus update ([3]), each pair of neighbors in G transmits 
and receives one variable Zj(t ~ 1). Since the message size depends only on the problem dimension d and does not change 
with to or n, we denote by r the time required to transmit and receive one message, relative to the 1 time unit required to 
compute the full gradient on all the data. If every node has k neighbors, the cost of one iteration in a network of n nodes is 

— h kr time units / iteration. (9) 

n 

Using this time model, we study the convergence rate bound |7) after attaching an appropriate time unit cost per iteration. To 
obtain a speedup by increasing the number of processors n for a given problem, we must ensure that e-accuracy is achieved 
in fewer time units. 



B. Simple Case: Communicate at every Iteration 

In the original DDA description <[3j-<[5]>, nodes communicate at every iteration. According to our time model, T iterations 
will cost r = !T(~ + kr) time units. From |7]), the time r(e) to reach error e is found by substituting for T and solving for 
r(e). Ignoring the log factor in (|7), we get 

Ci-± fr = e =► r(e) = ^(^ + kr) 



time units. (10) 



This simple manipulation reveals some important facts. If communication is free, then r = 0. If in addition the network G 
is a fc-regular expander, then A2 is fixed ifTOl . C\ is independent of n and r(e) = Cf/(e 2 n). Thus, in the ideal situation, we 
obtain a linear speedup by increasing the number of processors, as one would expect. In reality, of course, communication is 
not free. 

Complete graph. Suppose that G is the complete graph, where k — n — 1 and A2 = 0. In this scenario we cannot keep 
increasing the network size without eventually harming performance due to the excessive communication cost. For a problem 
with a communication/computation tradeoff r, the optimal number of processors is calculated by minimizing r(e) for n: 

dr(e) n _ _ 1 
dn 



= =>■ n opt = — . (11) 



Again, in accordance with intuition, if the communication cost is too high (i.e., r > 1) and it takes more time to transmit and 
receive a gradient than it takes to compute it, using a complete graph cannot speedup the optimization. We reiterate that r is a 
quantity that can be easily measured for a given hardware and a given optimization problem. As we report in Section [VJ the 
optimal value predicted by our theory agrees very well with experimental performance on a real cluster. 

Expander. For the case where G is a fc-regular expander, the communication cost per node remains constant as n increases. 
From ( [Tol l an d me expression for C\ in Q, we see that n can be increased without losing performance, although the benefit 
diminishes (relative to kr) as n grows. 
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IV. General Case: Sparse Communication 

The previous section analyzes the case where processors communicate at every iteration. Next we investigate the more 
general situation where we adjust the frequency of communication. 



A. Bounded Intercommunication Intervals 

Suppose that a consensus step takes place once every ft + 1 iterations. That is, the algorithm repeats ft > 1 cheap iterations 
(no communication) of cost i time units followed by an expensive iteration (with communication) with cost - + kr. This 
strategy clearly reduces the overall average cost per iteration. The caveat is that the network error \\z(t) — z,(i)||„ is higher 
because of having executed fewer consensus steps. 

In a cheap iteration we replace the update ([3]) by Zi(t) = Zi(t— 1) + <?t(i — 1). After some straight-forward algebra we can 
show that [for (12) , ( fT6| ) please consult the supplementary material]: 

H t -lh-l n Qt-1 

^,w=EEE [ pHt ~ w ]ij 9j( wh + k ) + E 9i(t -Q t + k). d2) 

w=0 k=0 j=l k=0 

where H t = L^ttJ counts the number of communication steps in t iterations, and Q t = mod(£, ft) if mod(i, ft) > and 
Qt = h otherwise. Using the fact that PI = 1, we obtain 

n Hf — 1 n h—1 

m - «® = - E *•(*) - *(*) = EEt- [ pHt ~% ) E tow + k) 

s — l w—Q j — 1 



n 



k=0 



+ - E E -Qt + k)~ 9i(t -Q t + k)). 



n 

s = l k=Q 

Taking norms, recalling that the ft are convex and Lipschitz, and since Q t < h, we arrive at 

flt-i 



(13) 
(14) 



1 T - [P ff *—] 



/iL + 2/iL 



(15) 



Using a technique similar to that in |4] to bound the i\ distance of row i of P Ht w to its stationary distribution as t grows, 
we can show that 



(16) 



1-VAa 

for all t < T. Comparing ( fTS) to equation (29) in [4J, the network error within t iterations is no more than h times larger when 
a consensus step is only performed once every h + 1 iterations. Finally, we substitute the network error in |6]). For a(t) = 



we have £t=i a(t) < 2^VT, and 

/ E>2 

En,(T) < (^ — 
We minimize the leading term Ch over A to obtain 



1 



12h 



I8h 



log(ryn) log(Tv^) 



r 



r 



A = — 



12ft 



1- VA 2 



and C h = 2RLJl + l8h- 



12ft 



1 - ^A : 5 



(17) 



(18) 



Of the T iterations, only Ht = Ltt^J i nv °l ve communication. So, T iterations will take 



r = (T - F T ) 



1 



#7 



fcr = 



Hxkr time units. 



(19) 



To achieve e-accuracy, ignoring again the logarithmic factor, we need T = —It iterations, or 



"(e) = 





T —1 






kr\ 












\n 


h 




\n 





time units. 



(20) 



From the last expression, for a fixed number of processors n, there exists an optimal value for h that depends on the network 
size and communication graph G: 



ikr 



18 



12 



(21) 



If the network is a complete graph, using h opt yields r(e) = 0(n); i.e., using more processors hurts performance when not 
communicating every iteration. On the other hand, if the network is a fc-regular expander then r(e) = ^= + c 2 for constants 
ci,C2, and we obtain a diminishing speedup. 
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B. Increasingly Sparse Communication 

Next, we consider progressively increasing the intercommunication intervals. This captures the intuition that as the opti- 
mization moves closer to the solution, progress slows down and a processor should have "something significantly new to say" 
before it communicates. Let hj — 1 denote the number of cheap iterations performed between the (j — l)st and jth expensive 
iteration; i.e., the first communication is at iteration hi, the second at iteration hi + h 2 , and so on. We consider schemes 
where hj — j p for p > 0. The number of iterations that nodes communicate out of the first T total iterations is given by 
H T = max{H: Y,f=i hj < T}. We have 

/ y p dy<Y,f<^+ v p dy => ^^<r<^+* (22) 

Jy=l Jy=l P+l P+l 

which means that Ht — 0(Tp+t) as T — >• oo. Similar to ( |15) , the network error is bounded as 



H t -1 
w=0 



-1 T - [P H *~ W ]. 



h u ,-l H t -1 



1 k=0 



L + 2h t L = L^ \\-\\ t h w + 2h t L. (23) 



We split the sum into two terms based on whether or not the powers of P have converged. Using the split point i = log ( T vj') , 

^ 1 — V A 2 

the li term is bounded by 2 when w is large and by i when w is small: 

H t -l-t H t -1 

\\z(t)-Zi(t)\l<L £ \\-\\ 1 h w + L £ + (24) 

w =0 w—H t -i 
Ht-l-t Ht-1 

<- J] w p + 2L E w p + 2i p L (25) 

w =0 w=H t -t 

L (H t -t-l)^ +P + + ^ 

1 p + 1 



since T > H t — i — 1, Substituting this bound into |6]) and taking the step size sequence to be a(t) = ^ with A and q to be 
determined, we get 

< i? 2 L 2 A 3L 2 A 3L 2 P A 

' ~ Af^ + 2{1 - q)Tl + {p + - q)T<i + (p+l)(l- (Z )T 1 +9 

{=1 t=l 
The first four summands converge to zero when < q < 1. Since = 0(t*+r), 

*=i t=i \ / 

which converges to zero if < g. To bound the last term, note that ^ Y^t=i t p ~ q — p-q+i ' so tne term § oes to zero as 
T — >• oo if p < q. In conclusion, Erri(T) converges no slower than O ( 1 ° S T ^^"' 1 ) since - 1 P < T }_ p . If we choose q = \ 

to balance the first three summands, for small p > 0, the rate of convergence is arbitrarily close to O( loe ^y"- 1 ), while nodes 
communicate increasingly infrequently as T —> oo. 

Out of T total iterations, DDA executes Ht — 0(Tp+t) expensive iterations involving communication and T— Ht cheap 
iterations without communication, so 

^) = 0(^ + T^)= (t(I + ^|-)). (30) 

In this case, the communication cost kr becomes a less and less significant proportion of r(e) as T increases. So for any 
< p < |, if k is fixed, we approach a linear speedup behaviour 0(— ). To get Err^(T) < e, ignoring the logarithmic factor, 
we need 



' q \ i-2 P / 12d + 12 12 

T = I — ) iterations, with C„ = 2LR J 7 + ^ =- + . (31) 

1 e 7 P V (3p + l)(l- V^) 2p+l 
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From this last equation we see that for < p < \ we have C p < C\, so using increasingly sparse communication should, in 
fact, be faster than communicating at every iteration. 

V. Experimental Evaluation 

To verify our theoretical findings, we implement DDA on a cluster of 14 nodes with 3.2 GHz Pentium 4HT processors and 
1 GB of memory each, connected via ethernet that allows for roughly 11 MB/sec throughput per node. Our implementation is 
in C++ using the send and receive functions of OpenMPI vl.4.4 for communication. The Armadillo v2.3.91 library, linked to 
LAPACK and BLAS, is used for efficient numerical computations. 

A. Application to Metric Learning 

Metric learning ifTTl . fl2l . lfT3l is a computationally intensive problem where the goal is to find a distance metric D(u,v) 
such that points that are related have a very small distance under D while for unrelated points D is large. Following the 
formulation in [14|, we have a data set {u 3 ■, Vj, Sj}JL 1 with Uj, Vj £ R d and Sj = { — 1,1} signifying whether or not uj 
is similar to Vj (e.g., similar if they are from the same class). Our goal is to find a symmetric positive semi-definite matrix 
A y to define a pseudo-metric of the form Da{u, v) = \/(u — v) T A(u — v). To that end, we use a hinge-type loss function 
lj(A,b) = max{0, Sj (DA{uj,Vj ) 2 - b) + 1} where b > 1 is a threshold that determines whether two points are dissimilar 
according to Da(-, •). In the batch setting, we formulate the convex optimization problem 

m 

minimize F(A, b) = lj(A, b) subject to A ^ 0, b > 1. (32) 



AM 



3 = 1 

The subgradient of L at (A, b) is zero if Sj(DA(uj,Vj) 2 — b) < —1. Otherwise 

b = S j( U 3 ~ V j) T ( U 3 ~ Vj), and 9lj ^ b = ~ S j- ( 33 > 

Since DDA uses vectors Xi(t) and Zi(t), we represent each pair (Ai(t), bi(t)) as a cP + l dimensional vector. The communication 
cost is thus quadratic in the dimension. In step |3]l of DDA, we use the proximal function ij)(x) = \x T x, in which case Q 
simplifies to taking Xi(t) = —a(t — \)zi(t), followed by projecting Xi(t) to the constraint set by setting bi(t) +- max{l, bi(t)} 
and projecting Ai (i) to the set of positive semi-definite matrices by first taking its eigenvalue decomposition and reconstructing 
Ai(t) after forcing any negative eigenvalues to zero. 

We use the MNIST digits dataset which consists of 28 x 28 pixel images of handwritten digits through 9. Representing 
images as vectors, we have d = 28 2 = 784 and a problem with d 2 + 1 = 614657 dimensions trying to learn a 784 x 784 matrix 

A. With double precision arithmetic, each DDA message has a size approximately 4.7 MB. We construct a dataset by randomly 
selecting 5000 pairs from the full MNIST data. One node needs 29 seconds to compute a gradient on this dataset, and sending 
and receiving 4.7 MB takes 0.85 seconds. The communication/computation tradeoff value is estimated as r — sa 0.0293. 
According to i fTT) , when G is a complete graph, we expect to have optimal performance when using n opt — 4^ = 5.8 nodes. 
Figure 1 (left) shows the evolution of the average function value F(t) = i T,i F (xi(t)) for 1 to 14 processors connected as 
a complete graph, where is as defined in Q. There is a very good match between theory and practice since the fastest 
convergence is achieved with n = 6 nodes. 

In the second experiment, to make r closer to 0, we apply PCA to the original data and keep the top 87 principal components, 
containing 90% of the energy. The dimension of the problem is reduced dramatically to 87 • 87 + 1 = 7570 and the message 
size to 59 KB. Using 60000 random pairs of MNIST data, the time to compute one gradient on the entire dataset with one 
node is 2.1 seconds, while the time to transmit and receive 59 KB is only 0.0104 seconds. Again, for a complete graph, Figure 
[TJright) illustrates the evolution of F(t) for 1 to 14 nodes. As we see, increasing n speeds up the computation. The speedup 
we get is close to linear at first, but diminishes since communication is not entirely free. In this case r = ° ° 1 1 04 = 0.005 and 
n op t = 14.15. 

B. Nonsmooth Convex Minimization 

Next we create an artificial problem where the minima of the components fi(x) at each node are very different, so that 
communication is essential in order to obtain an accurate optimizer of F(x). We define fi(x) as a sum of high dimensional 
quadratics, 

M 

where x G R 10,000 , M = 15,000 and c^^c 2 ^ are the centers of the quadratics. Figure |2j illustrates again the average function 
value F(t) for 10 nodes in a complete graph topology. The baseline performance is when nodes communicate at every iteration 



50 100 150 200 250 300 350 400 450 10 20 30 40 50 60 

Time (sec) Time (sec) 

Fig. 1. (Left) In a subset of the Full MNIST data for our specific hardware, n op t = = 5.8. The fastest convergence is achieved on a complete graph 
of 6 nodes. (Right) In the reduced MNIST data using PCA, the communication cost drops and a speedup is achieved by scaling up to 14 processors. 

(h = 1). For this problem r = 0.00089 and, from ( [2.1) , h opt = 1. Naturally communicating every 2 iterations (h = 2) slows 
down convergence. Over the duration of the experiment, with h = 2, each node communicates with its peers 55 times. We 
selected p — 0.3 for increasingly sparse communication, and got Ht = 53 communications per node. As we see, even though 
nodes communicate as much as the h = 2 case, convergence is even faster than communicating at every iteration. This verifies 
our intuition that communication is more important in the beginning. Finally, the case where p = 1 is shown. This value is 
out of the permissible range, and as expected DDA does not converge to the right solution. 



x10 s 




Time (sec) 

Fig. 2. Sparsifying communication to minimize (34) with 10 nodes in a complete graph topology. When waiting t 0,3 iterations between consensus steps, 
convergence is faster than communicating at every iteration (h = 1), even though the total number of consensus steps performed over the duration of the 
experiment is equal to communicating every 2 iterations (h = 2). When waiting a linear number of iterations between consensus steps (h = t) DDA does 
not converge to the right solution. Note: all methods are initialized from the same value; the x-axis starts at 5 sec. 



VI. Conclusions and Future Work 

The analysis and experimental evaluation in this paper focus on distributed dual averaging and reveal the capability of 
distributed dual averaging to scale with the network size. We expect that similar results hold for other consensus-based 
algorithms such as as well as various distributed averaging-type algorithms (e.g., IT31 . fl6l . ifTTl ). In the future we will 
extend the analysis to the case of stochastic optimization, where h t = t p could correspond to using increasingly larger mini- 
batches. 
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VII. Appendix 

A. Proof of equation ( |12| l 

Let us stack the local node variables in a vector z = [zi ■ ■ ■ z n ] T and g — [gi ■ ■ ■ g n ] T ■ From ([3J in matrix form we have after back- 
substituting in the recursion 



z(h + 1) = Pz(h) + g{h) = P E 9(k) + 9(h) 



(35) 



and after some algebra 



z(sh + 1) = Y Y pW 9((s - w)h + k) + g(sh) 



(36) 



or in general 



*(*) -EE pW 9{^ ~™)h + k)+J2 9(t -Qt + k) 
= EE P Ht ~ w+1 9((™ - l)h + k) + J2 g(t -Q t + k) 

w—l k— fc=0 
H t -H-1 Qt-1 

= EE P Ht ~ w g{™h + k)+ Y g(t-Qt + k) 



(37) 
(38) 
(39) 



m=0 fc=0 



fc=0 



where Ht = L^T^J counts the number of communication steps in i iterations and Q t = mod(t, h) if mod(i, h) > and Qt = h otherwise. 
From this last expression we take the i-th row to get the result. 



B. Proof of equation 

If the consensus matrix P is doubly stochastic it is straightforward to show that P l — > ^;11 T as t — > oo. Moreover, from standard 
Perron-Frobenius is it easy to show (see e.g., [?]) 



(40) 



so in our case ||^1 T — [P Ht w ] i . < y/ri (-s/A^) * Next, demand that the right hand side bound is less than y/nS with S to be 
determined: 

log or 1 ) 



1_ 1T _ 


[FX,: 


= 2 


-1 T - IP*]. 




n 




i 


n z,: 


TV 



\ H± — in 

\fn VX2 < VnS => H t - w> 



log {-J\2 ) 



So with the choice 5 1 = yfnT, 



n L 



,H t -w 



(41) 



(42) 



if Ht — w > log ^_ _ t- = i. When w is large and H t — w < i we simply take -1 T — \P Ht ~ 

log ( v^2 ) II 

is not obtained as follows 



< 2. The desired bound of 



H t -1 



S ; lT -[' 

x b T -i p 



hL + 2hL 



Ht-w 



w=0 

' Ht-i-1 H t -1 \ 

E 7 f +Y 2 \ hL + 2hL 



Ht ' 1 1 r 



H t -w 



H t -t 



hL + 2hL 



\ iu=0 

Ht-t 
T 



Ht-t / 

+ 2ffiX + 2hL. 



(43) 

(44) 
(45) 



Since t<Twe know that H t — i < T. Moreover, log (VX2) 1 > 1 — \fX%. Using there two fact we arrive at the result. 



